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Abstract 

Topic Modeling is an approach used for automatic comprehension and classification of data in 
a variety of settings, and perhaps the canonical application is in uncovering thematic structure 
in a corpus of documents. A number of foundational works both in machine learning [15] and 
in theory [27] have suggested a probabilistic model for documents, whereby documents arise as 
a convex combination of (i.e. distribution on) a small number of topic vectors, each topic vector 
being a distribution on words (i.e. a vector of word- frequencies). Similar models have since been 
used in a variety of application areas; the Latent Dirichlet Allocation or LDA model of Blei et 
al. is especially popular. 

Theoretical studies of topic modeling focus on learning the model's parameters assuming the 
data is actually generated from it. Existing approaches for the most part rely on Singular Value 
Decomposition (SVD), and consequently have one of two limitations: these works need to either 
assume that each document contains only one topic, or else can only recover the span of the 
topic vectors instead of the topic vectors themselves. 

This paper formally justifies Nonnegative Matrix Factorization (NMF) as a main tool in 
this context, which is an analog of SVD where all vectors are nonnegative. Using this tool 
we give the first polynomial-time algorithm for learning topic models without the above two 
limitations. The algorithm uses a fairly mild assumption about the underlying topic matrix 
called separability, which is usually found to hold in real-life data. A compelling feature of our 
algorithm is that it generalizes to models that incorporate topic-topic correlations, such as the 
Correlated Topic Model (CTM) and the Pachinko Allocation Model (PAM). 

We hope that this paper will motivate further theoretical results that use NMF as a replace- 
ment for SVD - just as NMF has come to replace SVD in many applications. 
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1 Introduction 



Developing tools for automatic comprehension and classification of data — web pages, newspaper 
articles, images, genetic sequences, user ratings — is a holy grail of machine learning. Topic 
Modeling is an approach that has proved successful in all of the aforementioned settings, though 
for concreteness here we will focus on uncovering thematic structure of a corpus of documents (see 
e.g. [4], [6]). 

In order to learn structure one has to posit the existence of structure, and in topic models one 
assumes a generative model for a collection of documents. Specifically, each document is represented 
as a vector of word- frequencies (the bag of words representation). Seminal papers in theoretical 
CS (Papadimitriou et al. [27]) and machine learning (Hofmann's Probabilistic Latent Semantic 
Analysis [15]) suggested that documents arise as a convex combination of (i.e. distribution on) a 
small number of topic vectors, where each topic vector is a distribution on words (i.e. a vector of 
word- frequencies). Each convex combination of topics thus is itself a distribution on words, and the 
document is assumed to be generated by drawing iV independent samples from it. Subsequent work 
makes specific choices for the distribution used to generate topic combinations — the well-known 
Latent Dirichlet Allocation (LDA) model of Blei et al [6] hypothesizes a Dirichlet distribution (see 
Section 4). 

In machine learning, the prevailing approach is to use local search (e.g. [11]) or other heuristics 
[32] in an attempt to find a maximum likelihood fit to the above model. For example, fitting to a cor- 
pus of newspaper articles may reveal 50 topic vectors corresponding to, say, politics, sports, weather, 
entertainment etc., and a particular article could be explained as a (1/2, 1/3, l/6)-combination of 
the topics politics, sports, and entertainment. Unfortunately (and not surprisingly), the maximum 
likelihood estimation is iVP-hard (see Section 6) and consequently when using this paradigm, it 
seems necessary to rely on unproven heuristics even though these have well-known limitations (e.g. 
getting stuck in a local minima [11, 28]). 

The work of Papadimitriou et al [27] (which also formalized the topic modeling problem) and 
a long line of subsequent work have attempted to give provable guarantees for the problem of 
learning the model parameters assuming the data is actually generated from it. This is in contrast 
to a maximum likelihood approach, which asks to find the closest-fit model for arbitrary data. The 
principal algorithmic problem is the following (see Section 1.1 for more details): 

Meta Problem in Topic Modeling: There is an unknown topic matrix A with 
nonnegative entries that is dimension n x r , and a stochastically generated unknown 
matrix W that is dimension r x m. Each column of AW is viewed as a probability 
distribution on rows, and for each column we are given N <C n i.i.d. samples from the 
associated distribution. 

Goal: Reconstruct A and parameters of the generating distribution for W . 

The challenging aspect of this problem is that we wish to recover nonnegative matrices A, W 
with small inner-dimension r. The general problem of finding nonnegative factors A, W of specified 
dimensions when given the matrix AW (or a close approximation) is called the Nonnegative Matrix 
Factorization (NMF) problem (see [22], and [2] for a longer history) and it is NP-hard [31]. Lacking 
a tool to solve such problems, theoretical work has generally relied on the Singular Value Decom- 
position (SVD) which given the matrix AW will instead find factors U, V with both positive and 
negative entries. SVD can be used as a tool for clustering - in which case one needs to assume that 
each document has only one topic. In Papadimitriou et al [27] this is called the pure documents 
case and is solved under strong assumptions about the topic matrix A (see also [26] and [1] which 
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uses the method of moments instead). Alternatively, other papers use SVD to recover the span of 
the columns of A (i.e. the topic vectors) [3], [20], [19], which suffices for some applications such as 
computing the inner product of two document vectors (in the space spanned by the topics) as a 
measure of their similarity. 

These limitations of existing approaches — either restricting to one topic per document, or else 
learning only the span of the topics instead of the topics themselves — are quite serious. In practice 
documents are much more faithfully described as a distribution on topics and indeed for a wide 
range of applications one needs the actual topics and not just their span - such as when browsing 
a collection of documents without a particular query phrase in mind, or tracking how topics evolve 
over time (see [4] for a survey of various applications). Here we consider what we believe to be 
a much weaker assumption - separability. Indeed, this property has already been identified as a 
natural one in the machine learning community [12] and has been empirically observed to hold in 
topic matrices fitted to various types of data [5]. 

Separability requires that each topic has some near-perfect indicator word - a word that we 
call the anchor word for this topic — that appears with reasonable probability in that topic but 
with negligible probability in all other topics (e.g., "soccer" could be an anchor word for the topic 
"sports"). We give a formal definition in Section 1.1. This property is particularly natural in 
the context of topic modeling, where the number of distinct words (dictionary size) is very large 
compared to the number of topics. In a typical application, it is common to have a dictionary size 
in the thousands or tens of thousands, but the number of topics is usually somewhere in the range 
from 50 to 100. Note that separability does not mean that the anchor word always occurs (in fact, 
a typical document may be very likely to contain no anchor words). Instead, it dictates that when 
an anchor word does occur, it is a strong indicator that the corresponding topic is in the mixture 
used to generate the document. 

Recently, we gave a polynomial time algorithm to solve NMF under the condition that the topic 
matrix A is separable [2] . The intuition that underlies this algorithm is that the set of anchor words 
can be thought of as extreme points (in a geometric sense) of the dictionary. This condition can 
be used to identify all of the anchor words and then also the nonnegative factors. Ideas from this 
algorithm are a key ingredient in our present paper, but our focus is on the question: 

Question. What if we are not given the true matrix AW , but are instead given a few samples (say, 
100 samples) from the distribution represented by each column? 

The main technical challenge in adapting our earlier NMF algorithm is that each document 
vector is a very poor approximation to the corresponding column of AW — it is too noisy in any 
reasonable measure of noise. Nevertheless, the core insights of our NMF algorithm still apply. 
Note that it is impossible to learn the matrix W to within arbitrary accuracy. (Indeed, this is 
information theoretically impossible even if we knew the topic matrix A and the distribution from 
which the columns of W are generated.) So we cannot in general give an estimator that converges 
to the true matrix W, and yet we can give an estimator that converges to the true topic matrix A\ 
(For an overview of our algorithm, see the first paragraph of Section 3.) 

We hope that this application of our NMF algorithm is just a starting point and other theoretical 
results will start using NMF as a replacement for SVD - just as NMF has come to replace SVD in 
several applied settings. 

1.1 Our Results 

Now we precisely define the topic modeling (learning) problem which was informally introduced 
above. There is an unknown topic matrix A which is dimension n x r (i.e. n is the dictionary size) 
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and each column of A is a distribution on [n]. There is an unknown r x m matrix W whose each 
column is itself a distribution (aka convex combination) on [r]. The columns of W are i.i.d. samples 
from a distribution T which belongs to a known family, e.g., Dirichlet distributions, but whose 
parameters are unknown. Thus each column of AW (being a convex combination of distributions) 
is itself a distribution on [n], and the algorithm's input consists of N i.i.d. samples for each column 
of AW. Here N is the document size and is assumed to be a constant for simplicity. Our algorithm 
can be easily adapted to work when the documents have different sizes. 

The algorithm's running time will necessarily depend upon various model parameters, since 
distinguishing a very small parameter from imposes a lower bound on the number of samples 
needed. The first such parameter is a quantitative version of separability, which was presented 
above as a natural assumption in context of topic modeling. 

Definition 1.1 (p-Separable Topic Matrix). An n x r matrix A is p- separable if for each i there is 
some row ir(i) of A that has a single nonzero entry which is in the i th column and it is at least p. 

The next parameter measures the lowest probability with which a topic occurs in the distribution 
that generates columns of W. 

Definition 1.2 (Topic Imbalance). The topic imbalance of the model is the ratio between the 

pry,! 

largest and smallest expected entries in a column of W, in other words, a = maxj j^[ r ] e[x ] wnere 
X 6 W is a random weighting of topics chosen from the distribution. 

Finally, we require that topics stay identifiable despite sampling-induced noise. To formalize 
this, we define a matrix that will be important throughout this paper: 

Definition 1.3 (Topic- Topic Covariance Matrix R(7)). If T is the distribution that generates the 
columns of W, then R(7) is defined as an r x r matrix whose (i,j)th entry is E[XiXj] where 
X\, X2, ...X r is a vector chosen from T. 

Let 7 > be a lower bound on the ^i-condition number of the matrix R(7). This is defined 
in Section 2, but for a r x r matrix it is within a factor of y/r of the smallest singular value. Our 
algorithm will work for any 7, but the number of documents we require will depend (polynomially) 
on I/7: 

Theorem 1.4 (Main). There is a polynomial time algorithm that learns the parameters of a topic 
model if the number of documents is at least 



where the three numbers a,p, 7 are as defined above. The algorithm learns the topic-term matrix A 
up to additive error e. Moreover, when the number of documents is also larger than O ^ log ^' r \ the 
algorithm can learn the topic-topic covariance matrix R{7) up to additive error e. 

As noted earlier, we are able to recover the topic matrix even though we do not always recover the 
parameters of the column distribution T. In some special cases we can also recover the parameters 
of T, e.g. when this distribution is Dirichlet, as happens in the popular Latent Dirichlet Allocation 
(LDA) model [6, 4]. In Section 4.1 we compute a lower bound on the 7 parameter for the Dirichlet 
distribution, which allows us to apply our main learning algorithm, and also the parameters of T 
can be recovered from the co- variance matrix R{7) (see Section 4.2). 
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Recently the basic LDA model has been refined to allow correlation among different topics, 
which is more realistic. See for example the Correlated Topic Model (CTM) [7] and the Pachinko 
Allocation Model (PAM) [24]. A compelling aspect of our algorithm is that it extends to these 
models as well: we can learn the topic matrix, even though we cannot always identify T. (Indeed, 
the distribution T in the Pachinko is not even identifiable: two different sets of parameters can 
generate exactly the same distribution) 

Comparison with existing approaches, (i) We rely crucially on separability. But note that 
this assumption is weaker in some sense than the assumptions in all prior works that provably 
learn the topic matrix. They assume a single topic per document, which can be seen as a strong 
separability assumption about W instead of A — in every column of W only one entry is nonzero. 
By contrast, separability only assumes a similar condition for a negligible fraction — namely, r out 
of n — of rows of A. Besides, it is found to actually hold in topic matrices found using current 
heuristics, (ii) Needless to say, existing theoretical approaches for recovering topic matrix A couldn't 
handle topic correlations at all since they only allow one topic per document, (iii) We remark that 
prior approaches that learn the span of A instead of A needed strong concentration bounds on 
eigenvalues of random matrices, and thus require substantial document sizes (on the order of the 
number of words in the dictionary!). By contrast we can work with documents of O(l) size. 

2 Tools for (Noisy) Nonnegative Matrix Factorization 
2.1 Various Condition Numbers 

Central to our arguments will be various notions of matrices being "far" from being low-rank. The 
most interesting one for our purposes was introduced by Kleinberg and Sandler [19] in the context 
of collaborative filtering; and can be thought of as an ^i-analogue to the smallest singular value of 
a matrix. 

Definition 2.1 (£\ Condition Number). If matrix B has nonnegative entries and all rows sum to 
1 then its l\ Condition Number T(B) is defined as: 

T(B) = min ||ai.B||i. 

|x||i=l 

If B does not have row sums of one then T(B) is equal to T(DB) where D is the diagonal matrix 
such that DB has row sums of one. 

For example, if the rows of B have disjoint support then T(B) = 1 and in general the quantity 
T(B) can be thought of a measure of how close two distributions on disjoint sets of rows can be. 
Note that, if x is an n-dimensional real vector, ||x||2 < ||^||i < V^ll^lb an d hence (if o- m i n (B) is 
the smallest singular value of B), we have: 

-^=o- min {B) < T(B) < ^mo m in{B). 

The above notions of condition number will be most relevant in the context of the topic-topic 
covariance matrix R(7). We shall always use 7 to denote the i\ condition number of i?(T). The 
definition of condition number will be preserved even when we estimate the topic-topic covariance 
matrix using random samples. 



4 



Lemma 2.2. When m > 51ogr/eQ, with high probability the matrix R = ^WW T is entry-wise 
close to -R(T) with error eo- Further, when eo < 1/A^ar 2 where a is topic imbalance, the matrix R 
has t\ condition number at least 7/2. 

Proof: Since E[WjW^] = R(T), the first part is just by Chernoff bound and union bound. The 
further part follows because R(T) is 7 robustly simplicial, and the error can change the t\ norm of 
vR for any unit v by at most ar ■ re$. The extra factor ar comes from the normalization to make 
rows of R sum up to 1. ■ 

In our previous work on nonnegative matrix factorization [2] we defined a different measure of 
"distance" from singular which is essential to the polynomial time algorithm for NMF: 

Definition 2.3 (/3-robustly simplicial). If each column of a matrix A has unit i\ norm, then we 
say it is j3-robustly simplicial if no column in A has l\ distance smaller than /3 to the convex hull 
of the remaining columns in A. 

The following claim clarifies the interrelationships of these latter condition numbers. 

Claim 2.4. (i) If A is p- separable then A T has £± condition number at least p. (ii) If A T has all 
row sums equal to 1 then A is (3-robustly simplicial for f3 = T(A T )/2. 

We shall see that the l\ condition number for product of matrices is at least the product of l\ 
condition number. The main application of this composition is to show that the matrix R(7)A T 
(or the empirical version RA T ) is at least 0(7p)-robustly simplicial. The following lemma will play 
a crucial role in analyzing our main algorithm: 

Lemma 2.5 (Composition Lemma). If B and C are matrices with £1 condition number T(B) > 7 
and r(C) > /3 , then T(BC) is at least ^7. Specificially, when A is p-separable the matrix R(7)A T 
is at least ■yp/2-robustly simplicial. 

Proof: The proof is straight forward because for any vector x, we know Hx-BC^ < T(C) Hx-B^ < 
T(C)T(B) \\x\\ v For the matrix R(7)A T , by Claim 2.4 we know the matrix A T has l\ condition 
number at least p. Hence r(i?(T)^4 T ) is at least 'yp and again by Claim 2.4 the matrix is 7p/2- 
robustly simplicial. ■ 



2.2 Noisy Nonnegative Matrix Factorization under Separability 

A key ingredient is an approximate NMF algorithm from [2], which can recover an approximate 
nonnegative matrix factorization M ~ AW when the l\ distance between each row of M and the 
corresponding row in AW is small. We emphasize that this is not enough for our purposes, since 
the term-by-document matrix M will have a substantial amount of noise (when compared to its 
expectation) precisely because the number of words in a document N is much smaller than the 
dictionary size n. Rather, we will apply the following algorithm (and an improvement that we give 
in Section 5) to the Gram matrix MM T . 

Theorem 2.6 (Robust NMF Algorithm [2]). Suppose M = AW where W and M are normalized 
to have rows sum up to 1, A is separable and W is ^-robustly simplicial. Let e = Oi^ 2 ). There is 
a polynomial time algorithm that given M such that for all rows M % — M % < e, finds a W' such 

that \\W n — W i \\ 1 < lOe/7 + 7e. Further every row W' % in W' is a row in M . The corresponding 
row in M can be represented as (1 - 0{e/j 2 ))W i + 0(e/7 2 )W -i . Here W~ l is a vector in the 
convex hull of other rows in W with unit length in i\ norm. 
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In this paper we need a slightly different goal here than in [2]. Our goal is not to recover 
estimates to the anchor words that are close in £i-norm but rather to recover almost anchor words 
(word whose row in A has almost all its weight on a single coordinate). Hence, we will be able 
to achieve better bounds by treating this problem directly, and we give a substitute for the above 
theorem. We defer the proof to Section 5. 

Theorem 2.7. Suppose M = AW where W and M are normalized to have rows sum up to 1, A 
is separable and W is ^-robustly simplicial. When e < 7/IOO there is a polynomial time algorithm 
that given M such that for all rows \\M l — M l \\\ < e, finds r row (almost anchor words) in M. The 
i-th almost anchor word corresponds to a row in M that can be represented as (1 — 0(e/'y))W l + 
0(e/j)W~ l . Here W~ % is a vector in the convex hull of other rows in W with unit length in l\ 
norm. 

3 Algorithm for Learning a Topic Model: Proof of Theorem 1.4 

First it is important to understand why separability helps in nonnegative matrix factorization, and 
specifically, the exact role played by the anchor words. Suppose the NMF algorithm is given a 
matrix AB. If A is p-separable then this means that A contains a diagonal matrix (up to row 
permutations). Thus a scaled copy of each row of B is present as a row in AB. In fact, if we 
knew the anchor words of A, then by looking at the corresponding rows of AB we could "read off" 
the corresponding row of B (up to scaling), and use these in turn to recover all of A. Thus the 
anchor words constitute the "key" that "unlocks" the factorization, and indeed the main step of 
our earlier NMF algorithm was a geometric procedure to identify the anchor words. When one is 
given a noisy version of AB, the analogous notion is "almost anchor" words, which correspond to 
rows of AB that are "very close" to rows of B; see Theorem 2.7. 

Now we sketch how to apply these insights to learning topic models. Let M denote the provided 
term-by-document matrix, whose each column describes the empirical word frequencies in the 
documents. It is obtained from sampling AW and thus is an extremely noisy approximation to 
AW. Our algorithm starts by forming the Gram matrix MM T , which can be thought of as an 
empirical word- word covariance matrix. In fact as the number of documents increases ^MM T 
tends to a limit Q = ±E{AWW T A], implying Q = AR(7)A T . (See Lemma 3.7.) Ima gine that we 
are given the exact matrix Q instead of a noisy approximation. Notice that Q is a product of three 
nonnegative matrices, the first of which is p-separable and the last is the transpose of the first. 
NMF at first sight seems too weak to help find such factorizations. However, if we think of Q as 
a product of two nonnegative matrices, A and R(7)A T , then our NMF algorithm [2] can at least 
identify the anchor words of A. As noted above, these suffice to recover R(7)A T ', and then (using 
the anchor words of A again) all of A as well. See Section 3.1 for details. 

Of course, we are not given Q but merely a good approximation to it. Now our NMF algorithm 
allows us to recover "almost anchor" words of A, and the crux of the proof is Section 3.2 showing that 
these suffice to recover provably good estimates to A and WW T . This uses (mostly) bounds from 
matrix perturbation theory, and interrelationships of condition numbers mentioned in Section 2. 

For simplicity we assume the following condition on the topic model, which we will see in 
Section 3.5 can be assumed without loss of generality: 
(*) The number of words, n, is at most Aar/e. 
Please see Algorithm 1: Main Algorithm for description of the algorithm. Note that R is our 
shorthand for -^WW T , which as noted converges to R(7) as the number of documents increases. 
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Algorithm 1. Main Algorithm, Output: R and A 



1. Query the oracle for m documents, where 



2. Split the words of each document into two halves, and let M, M' be the term-by-document matrix 
with first and second half of words respectively. 

3. Compute word- by- word matrix Q = j^MM' T 

4. Apply the "Robust NMF" algorithm of Theorem 2.7 to Q which returns r words that are "almost" 
the anchor words of A. 

5. Use these r words as input to Recover with Almost Anchor Words to compute R = ±-WW T 
and A 



Algorithm 2. Recover with True Anchor Words 
Input: r anchor words, Output: R and A 

1. Permute the rows and columns of Q so that the anchor words appear in the first r rows and columns 

2. Compute DRA T T (which is equal to DRl) 

3. Solve for z: DRDz = DRl. 

4. Output A T = ((DRDmag(z))- 1 DRA T ). 

5. Output R = (Diag(z)Di?DDiag(z)). 



3.1 Recover R and A with Anchor Words 

We first describe how the recovery procedure works in an "idealized" setting (Algorthm 2, Recover 
with True Anchor Words), when we are given the exact value of ARA T and a set of anchor 
words - one for each topic. We can permute the rows of A so that the anchor words are exactly 
the first r words. Therefore A T = (D, U T ) where D is a diagonal matrix. Note that D is not 
necessarily the identity matrix (nor even a scaled copy of the identity matrix), but we do know 
that the diagonal entries are at least p. We apply the same permutation to the rows and columns 
of Q. As shown in Figure 1, if we look at the submatrix formed by the first r rows and r columns, 
it is exactly DRD. Similarly, the submatrix consisting of the first r rows is exactly DRA T . We 
can use these two matrices to compute R and A, in this idealized setting (and we will use the same 
basic strategy in the general case, but need only be more careful about how we analyze how errors 
compound in our algorithm). 

Our algorithm has exact knowledge of the matrices DRD and DRA T ', and so the main task 
is to recover the diagonal matrix D. Given D, we can then compute A and R (for the Dirichlet 
Allocation we can also compute its parameters - i.e. the a so that R(a) = R). The key idea to this 
algorithm is that the row sums of DR and DRA T are the same, and we can use the row sums of 
DR to set up a system of linear constraints on the diagonal entries of D^ 1 . 

Lemma 3.1. When the matrix Q is exactly equal to ARA T and we know the set of anchor words, 
Recover with True Anchor Words outputs A and R correctly. 

Proof: The Lemma is straight forward from Figure 1 and the procedure. By Figure 1 we can 



7 



D 



U 



W 



D 



U 



W T 



DRD 


DRU T 


URD 


URU T 



Figure 1: The matrix Q 

find the exact value of DRA T and DRD in the matrix Q. Step 2 of recover computes DR1 by 
computing DRA T 1. The two vectors are equal because A is the topic-term matrix and its columns 
sum up to 1, in particular A T l = 1. 

In Step 3, since R is invertible by Lemma 2.2, D is a diagonal matrix with entries at least p, the 
matrix DRD is also invertible. Therefore there is a unique solution z = (DRD)^ 1 DR1 = D^ 1 !. 
Also Dz = 1 and hence DDiag(z) = /. Finally, using the fact that DDiag(z) = I, the output in 
step 4 is just (DR)~ 1 DRA T = A T , and the output in step 5 is equal to R. ■ 

3.2 Recover R and A with Almost Anchor Words 

What if we are not given the exact anchor words, but are given words that are "close" to anchor 
words? As we noted, in general we cannot hope to recover the true anchor words, but even a good 
approximation will be enough to recover R and A. 

When we restrict A to the rows corresponding to "almost" anchor words, the submatrix will 
not be diagonal. However, it will be close to a diagonal in the sense that the submatrix will be a 
diagonal matrix D multiplied by E, and E is close to the identity matrix (and the diagonal entries 
of D are at least Q(p)). Here we analyze the same procedure as above and show that it still recovers 
A and R (approximately) even when given "almost" anchor words instead of true anchor words. For 
clarity we state the procedure again in Algorithm 3: Recover with Almost Anchor Words. 
The guarantees at each step are different than before, but the implementation of the procedure 
is the same. Notice that here we permute the rows of A (and hence the rows and columns of Q) 
so that the "almost" anchor words returned by Theorem 2.6 appear first and the submatrix A on 
these rows is equal to DE. 

Here, we still assume that the matrix Q is exactly equal to ARA T and hence the first r rows of 
Q form the submatrix DERA T and the first r rows and columns are DERE T D. The complication 
here is that Diag(z) is not necessarily equal to -D" 1 , since the matrix E is not necessarily the 
identity. However, we can show that Diag(z) is "close" to D~ 1 if E is suitably close to the identity 
matrix - i.e. given good enough proxies for the anchor words, we can bound the error of the above 
recovery procedure. We write E = I + Z . Intuitively when Z has only small entries E should 
behave like the identity matrix. In particular, E^ 1 should have only small off-diagonal entries. We 
make this precise through the following lemmas: 

Lemma 3.2. Let E = I + Z and = e < 1/2, then E" 1 ! is a vector with entries in the 

range [1 - 2e, l + 2e]. 
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Algorithm 3. Recover with Almost Anchor Words 
Input: r "almost" anchor words, Output: R and A 

1. Permute the rows and columns of Q so that the "almost" anchor words appear in the first r rows and 
columns. 

2. Compute DERA T 1 (which is equal to DERI) 

3. Solve for z: DERE T Dz = DERI. 

4. Output A T = ((DERE T DDiag(z))- 1 DERA T ). 

5. Output R = (Diag(z)D£ , i?£ ,T DDiag(z)). 



Proof: E is clearly invertible because the spectral norm of Z is at most 1/2. Let b = E l. bince 
E = I + Z we multiply E on both sides to get b + Zb = 1. Let b max be the largest absolute 
value of any entry of b (b max = max \bi\). Consider the entry % where b max is achieved, we know 
bmax = \h\ < 1 + \{Zb)i\ < 1 + \ z i,j\\bj\ < 1 + eb max . Thus b max < 1/(1 - e) < 2. Now all the 
entries in Zb are within 2e in absolute value, and we know that 6 = 1 + Zb. Hence all the entries 
of b are in the range [1 — 2e, 1 + 2e], as desired. ■ 

Lemma 3.3. Let E = I + Z and ^ i ■ \Zij\ = e < 1/2, then the columns of E^ 1 — I have l\ norm 
at most 2e. 

Proof: Without loss of generality, we can consider just the first column of E" 1 — I, which is 
equal to (E" 1 — I)e~[, where e\ is the indicator vector that is one on the first coordinate and zero 
elsewhere. 

The approach is similar to that in Lemma 3.2. Let b = (E~ l — I)e{. Left multiply by E = (I+Z) 
and W6 obtain b -\- Zb — — Zc\. Hence b — — Z{b -\- 61). Let b ma x be the largest absolute value of 
entries of b (b max = max|6J). Let i be the entry in which b max is achieved. Then 

bmax = \h\ < \{Zb)i\ + |(£ei);| < eb max + e 

Therefore b max < e/(l - e) < 2e. Further, the ||6||i < \\Ze[\\i + ||Zb||i < e + 2e 2 < 2e. ■ 

Now we are ready to show that the procedure Recover with Almost Anchor Words 
succeeds when given "almost" anchor words: 

Lemma 3.4. When the matrix Q is exactly equal to ARA T , the matrix A restricted to almost 
anchor words is DE where E — I has i\ norm e < 1/10 when viewed as a vector, procedure 
Recover with Almost Anchor Words outputs A such that each column of A has l\ error at 
most 6e. The matrix R has additive error Zr whose l\ norm when viewed as a vector is at most 
8e. 

Proof: Since Q is exactly ARA T , our algorithm is given DERA T and DERE 1 D with no error. 
In Step 3, since D, E and R are all invertible, we have 

z = (dere t d)- 1 derT = d- 1 (e t )- 1 i 

Ideally we would want Diag(z) = D~^, and indeed DDiag(z) = Diag((£'- r ) _1 l). From Lemma 3.2, 
the vector (_E T ) -1 1 has entries in the range [1 — 2e, 1 + 2e], thus each entry of Diag(z) is within a 
(1 ± 2e) multiplicative factor from the corresponding entry in D -1 . 
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Consider the output in Step 4. Since D, E, R are invertible, the first output is 

(DERE T DDiag(z))~ 1 DERA T = (.DDiagO)) -1 (E T y x A T 

Our goal is to bound the l\ error of the columns of the output compared to the corresponding 
columns of A. Notice that it is sufficient to show that the j th row of ( J DDiag(z))- 1 (^ T )- 1 is close 
(in i\ distance) to the indicator vector e~j T . 

Claim 3.5. For each j, ||eJ T ' {DDiag(z))~ 1 (E T )~ 1 - ej T ||i < 5e 

Proof: Again, without loss of generality we can consider just the first row. From Lemma 3.3 
e~i T (E T )~ 1 has l\ distance at most 2e to e~[ T . (DDiag(z)) -1 has entries in the range [1 — 3e, 1 + 3e]. 
And so 

||ei T (i?Diag(z))- 1 (S r )- 1 -ei r || 1 < || e l T pDiag(z))- 1 (^)- 1 -el T (^)- 1 || 1 + ||el T (^)- 1 - e I T || 1 

The last term can be bounded by 2e. Consider the first term on the right hand side: The vector 
e"i T (-DDiag(z))~ 1 — e{ T has one non-zero entry (the first one) whose absolute value is at most 3e. 
Hence, from Lemma 3.3 the first term can be bounded by 6e 2 < 3e, and this implies the claim. ■ 

The first row of (DDi&g(z))~ l (E T )~ 1 A T is A\ + z T A where z is a vector with l\ norm at most 
be. So every column of A is recovered with l\ error at most 6e. 

Consider the second output of the algorithm. The output is Diag(z) D E RE T L>Diag(z) and we 
can write Diag(z)Z) = I + Z\ and E = I + Z2. The leading error are Z\R + Z<iR + RZ\ + RZ2 
and hence the i\ norm of the leading error term (when treated as a vector) is at most 6e and other 
terms are of order e 2 and can safely be bounded by 2e for suitably small e). ■ 

Finally we consider the general case (in which there is additive noise in Step 1): we are not 
given ARA T exactly. We are given Q which is close to ARA T (by Lemma 3.7). We will bound the 
accumulation of this last type of error. Suppose in Step 1 of RECOVER we obtain DERA T + U 
and DERE T D + V and furthermore the entries of U and U 1 have absolute value at most t\ and 
the matrix V has t\ norm €2 when viewed as a vector. 

Lemma 3.6. Ife,ei,€2 are sufficiently small, RECOVER outputs A such that each entry of A has 
additive error at most 0(e + [ra^lv^ + cir /p 2 ) /t)- Also the matrix R has additive error Zr whose 
l\ norm when viewed as a vector is at most 0(e + (rae2/p 3 + eir/p 2 )/^/). 

The main idea of the proof is to write DERE T D + V as DER(E T + V')D. In this way the 
error V can be translated to an error V' on E and Lemma 3.4 can be applied. The error U can be 
handled similarly. 

Proof: We shall follow the proof of Lemma 3.4. First can express the error term V instead as 
V = (DER)V'(D). This is always possible because all of D, E, R are invertible. Moreover, the l\ 
norm of V when viewed as a vector is at most 8rae2/7P 3 , because this norm will grow by a factor 
of at most 1/p when multiplied by D~ l , a factor of at most 2 when multiplied by E~ l and at most 
ra/Y{R) when multiplied by R . The bound of T(R) comes from Lemma 2.2, we lose an extra ra 
because R may not have rows sum up to 1. 

Hence DERE T D + V = DER(E T + V')D and the additive error for DERE T D can be trans- 
formed into error in E, and we will be able to apply the analysis in Lemma 3.4. 

Similarly, we can express the error term U as U = DERU' . Entries of U 1 have absolute value 
at most 8eir/7p 2 . The right hand side of the equation in step 3 is equal to DERI + Ul so the 
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error is at most e\ per entry. Following the proof of Lemma 3.4, we know Diag(z)D has diagonal 
entries within 1 ± (2e + 16e2/7P 3 + 2ei). 

Now we consider the output. The output for A T is equal to 

(DER(E T + V')DBiag(z)y 1 DER(A T + U') = (DDiag(z))" 1 {E T + V')- l {A T + U 1 ). 

Here we know (E T + V')~ l — I has t\ norm at most 0(e + rae 2 /^fp 3 ) per row, (DDiag(z)) is a 
diagonal matrix with entries in l±0(e + rae 2 /^/p 3 + e\), entries of U' has absolute value 0(e\r /^p 2 ). 
Following the proof of Lemma 3.4 the final entry-wise error of A is roughly the sum of these three 
errors, and is bounded by 0(e + (rae 2 /p 3 + ^if/p 2 )/^) (Notice that Lemma 3.4 gives bound for l\ 
norm of rows, which is stronger. Here we switched to entry-wise error because the entries of U are 
bounded while the l\ norm of U might be large). 

Similarly, the output of R is equal to T>iag(z)(DERE T D+V)Diag(z). Again we write T)i&g(z)D = 
I + Z\ and E = I + Z 2 . The extra term Diag(z)yDiag(z) is small because the entries of z are 
at most to 2/p (otherwise Diag(z)Z) won't be close to identity). The error can be bounded by 
0(e + (rae 2 /p 3 + eir/p 2 )/^). ■ 

Now in order to prove our main theorem we just need to show that when number of documents 
is large enough, the matrix Q is close to the ARA T , and plug the error bounds into Lemma 3.6. 

3.3 Error Bounds for Q 

Here we show that the matrix Q indeed converges to ±AWW T A T = ARA T when m is large 
enough. 

Lemma 3.7. When m > 5( ^°f - , with high probability all entries of Q — ^AWW T A T have absolute 

Q 

value at most €q. Further, the t\ norm of rows of Q are also eg close to the t\ norm of the 
corresponding row in ±AWW T A T . 

Proof: We shall first show that the expectation of Q is equal to ARA T where R is ^WW T . Then 
by concentration bounds we show that entries of Q are close to their expectations. Notice that we 
can also hope to show that Q converges to AR{7)A T . However in that case we will not be able to 
get the inverse polynomial relationship with N (indeed, even if iV goes to infinity it is impossible 
to learn R{7) with only one document). Replacing R(7) with the empirical R allows our algorithm 
to perform better when the number of words per document is larger. 

To show the expectation is correct we observe that conditioned on W, the entries of two matrices 
M and M' are independent. Their expectations are both f AW. Therefore, 

E[Q] = -i-E[M ff ] = -( -E[mA ( -B[M' T }) = -AWW T A T = ARA T . 
mN 2 m \N J \N ) m 

We still need to show that Q is close to this expectation. This is not surprising because Q is the 
average of m independent samples (of ^MiM'j). Further, the variance of each entry in ■^MiM- T 
can be bounded because M and M' also come from independent samples. For any i, jx, j 2 , let v = 
AW{ be the probability distribution that Mi and M[ are sampled from, then Mj(ji) is distributed 
as Binomial(N/2,v(ji)) and M[{j2) is distributed as Binomial (N/2, f(j2)) • The variance of these 
two variables are less than N/8 no matter what v is by the properties of binomial distribution. 
Conditioned on the vector v these two variables are independent, thus the variance of their product 
is at most Var Mi (31) E M[ (j 2 ) 2 + EM i (j 1 ) 2 Var M[ ( j 2 ) + Var M< (ji ) Var M[ (j 2 ) < N 3 /4 + iV 2 /64. 
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The variance of any entry in ■^■MjM^ is at most A/N + l/16iV 2 = 0(1/N). Higher moments can 
be bounded similarly and they satisfy the assumptions of Bernstein inequalities. Thus by Bernstein 
inequalities the probability that any entry is more than €q away from its true value is much smaller 
than 1/n 2 . 

The further part follows from the observation that the l\ norm of a row in Q is proportional 
to the number of appearances of the word. As long as the number of appearances concentrates the 
error in l\ norm must be small. The words are all independent (conditioned on W) so this is just 
a direct application of Chernoff bounds. ■ 



3.4 Proving the Main Theorem 

We are now ready to prove Theorem 1.4: 

Proof: By Lemma 3.7 we know when we have at least 501ogn/-/Veg documents, Q is entry wise 
close to ARA T . In this case error per row for Theorem 2.6 is at most 6q ■ 0(a 2 r 2 /p 2 ) because in 
this step we can assume that there are at most Aar/p words (see Section 3.5) and to normalize the 
row we need a multiplicative factor of at most War/p (we shall only consider rows with l\ norm 
at least p/War, with high probability all the anchor words are in these rows). The 7 parameter 
for Theorem 2.7 is p*y/A by Lemma 2.2. Thus the almost anchor words found by the algorithm has 
weight at least 1 — 0(eQa 2 r 2 /jp 3 ) on diagonals. The error for DERE T D is at most egr 2 , the error 
for any entry of DERA T and DERA T 1 is at most 0{eq). Therefore by Lemma 3.6 the entry-wise 
error for A is at most 0{eQa 2 r 3 /jp 3 ). 

When eg < ep 3r y/a 2 r 3 the error is bounded by e . In this case we need 

f / log n • a 4 r G \ ( log r • o 2 r 4 
m = max < O — „ „ r— , O 



The latter constraint comes from Lemma 2.2. 

To get within e additive error for the parameter a, we further need R to be close enough to the 
variance-covariance matrix of the document-topic distribution, which means m is at least 

f / log n • a 4 r 6 \ / log r • a 2 r 4 \ / log r • r 2 

m = max r UwW ' I 7 2 r° 



3.5 Reducing Dictionary Size 

Above we assumed that the number of distinct words is small. Here, we give a simple gadget that 
shows in the general case we can assume that this is the case at the loss of an additional additive 
e in our accuracy: 

Lemma 3.8. The general case can be reduced to an instance in which there are at most Aar/e 
words all of which (with at most one exception) occur with probability at least e/Aar. 

Proof: In fact, we can collect all words that occur infrequently and "merge" all of these words into 
a aggregate word that we will call the runoff word. To this end, we call a word large if it appears 
more than emN/3ar times in m = 100c ^° g - documents, and otherwise we call it small. Indeed, 
with high probability all large words are words that occur with probability at least e/Aar in our 
model. Also, all words that has a entry larger than e in the corresponding row of A will appear 
with at least e/ar probability, and is thus a large word with high probability. We can merge all 
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small words (i.e. rename all of these words to a single, new word). Hence we can apply the above 
algorithm (which assumed that there are not too many distinct words). After we get a result with 
the modified documents we can ignore the runoff words and assign weight for all the small words. 
The result will still be correct up to e additive error. ■ 



4 The Dirichlet Subcase 

Here we demonstrate that the parameters of a Dirichlet distribution can be (robustly) recovered 
from just the covariance matrix R(7). Hence an immediate corollary is that our main learning 
algorithm can recover both the topic matrix A and the distribution that generates columns of W in 
a Latent Dirichlet Allocation (LDA) Model [6], provided that A is separable. We believe that this 
algorithm may be of practical use, and provides the first alternative to local search and (unproven) 
approximation procedures for this inference problem [32], [11], [6]. 

The Dirichlet distribution is parametrized by a vector a of positive reals is a natural family of 
continuous multivariate probability distributions. The support of the Dirichlet Distribution is the 
unit simplex whose dimension is the same as the dimension of a. Let a be a r dimensional vector. 
Then for a vector 6 £ W in the r dimensional simplex, its probability density is given by 

where T is the Gamma function. In particular, when all the c^'s are equal to one, the Dirichlet 
Distribution is just the uniform random distribution over the probability simplex. 

The expectation and variance of 9iS are easy to compute given the parameters a. We denote 
a o = II a Hi = YH=i a ii then the ratio a^/ao should be interpreted as the "size" of the i-th variable 
9i, and ao shows whether the distributions is concentrated in the interior (when ao is large) or 
near the boundary (when ao is small) . The first two moments of Dirichlet Distribution is listed as 
below: 

m\ = -■ 

CtQ 



— ^ ia \ when i ^ j 
when i = j 



Suppose the Dirichlet distribution has maxa^/mina^ = a and the sum of parameters is ao; we 
give an algorithm that computes close estimates to the vector of parameters a given a sufficiently 
close estimate to the co-variance matrix R(7) (Theorem 4.3). Combining this with Theorem 1.4, 
we obtain the following corollary: 

Theorem 4.1. There is an algorithm that learns the topic matrix A with high probability up to an 
additive error of e from at most 

f /logn • a 6 r 8 (ao + 1) 4 \ „ /logr • a 2 r 4 (ao + l"* 2 
m = max < U 9 „ , U 



e 2 P e N y ' v e2 

documents sampled from the LDA model and runs in time polynomial in n, m. Furthermore, we 
also recover the parameters of the Dirichlet distribution to within an additive e. 

Our main goal in this section is to bound the ^-condition number of the Dirichlet distribution 
(Section 4.1), and using this we show how to recover the parameters of the distribution from its 
covariance matrix (Section 4.2). 
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Algorithm 4. DlRlCHLET(i?) , Input: R, Output: a (vector of parameters) 



1. Set a/ay = Rl. 

2. Let i be the row with smallest l\ norm, let u = Ri.i and v = at/ao. 

3. Set an = 1 '-, u ^ v . 

u u/v — v 

4. Output a = ao ■ (a/ ceo). 



4.1 Condition Number of a Dirichlet Distribution 

There is a well-known meta-principle that if a matrix W is chosen by picking its columns indepen- 
dently from a fairly diffuse distribution, then it will be far from low rank. However, our analysis 
will require us to prove an explicit lower bound on T(R(7)). We now prove such a bound when the 
columns of W are chosen from a Dirichlet distribution with parameter vector a. We note that it is 
easy to establish such bounds for other types of distributions as well. Recall that we defined R(7) 
in Section 1, and here we will abuse notation and throughout this section we will denote by R(a) 
the matrix R(7) where 7 is a Dirichlet distribution with parameter a. 

Let ao = Si=i a «- The mean, variance and co- variance for a Dirichlet distribution are well- 
known, from which we observe that R(a)ij is equal to ao "^[+i) wnen * j an d is equal to ~7~j^fi 
when i = j. 

Lemma 4.2. The l\ condition number of R(a) is at least 2 ( a l+i) • 

Proof: As the entries R(a)i j is — when i ^ j and w hen i = j, after normalization 

R(a) is just the matrix D' = Qo 1 +1 (a x (1,1,..., 1) + 1) where x is outer product and / is the 
identity matrix. 

Let x be a vector such that \x\\ = 1 and |-D'x|i achieves the minimum in T(R(a)) and let 
I = {i\xi > 0} and let J = I be the complement. We can assume without loss of generality that 
^ 1 12i£j x i\ (otherwise just take — x instead). The product D'x is J^qr[« + ao \i x ' ^hc 
first term is a nonnegative vector and hence for each i £ I, (D'x) 1 > 0. This implies that 



\D'xU > — - — Vxj > ; 1 
1 11 " a + 1 4ri ~ 2 a + 1 



4.2 Recovering the Parameters of a Dirichlet Distribution 

When the variance covariance matrix R(a) is recovered with error €r in t\ norm when viewed as 
a vector, we can use Algorithm 4: Dirichlet to compute the parameters for the Dirichlet. 

Theorem 4.3. When the variance covariance matrix R(a) is recovered with error en in t\ norm 
when viewed as a vector, the procedure Dirichlet (i?) learns the parameter of the Dirichlet distri- 
bution with error at most O(ar(ao + 1)c_r). 

Proof: The aj/ao's all have error at most e^. The value u is ^ qq+i ^~ eR an< ^ ^ e vame v ^ s 
aj/ao ± €r. Since v > 1/ar we know the error for u/v is at most 2areR. Finally we need to bound 
the denominator ai+ ,\ — — > 1 . 1S (since — < 1/r < 1/2). Thus the final error is at most 
5ar(ao + l)e^. ■ 
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5 Obtaining Almost Anchor Words 



In this section, we prove Theorem 2.7, which we restate here: 

Theorem 5.1. Suppose M = AW where W and M are normalized to have rows sum up to 1, A 
is separable and W is ^-robustly simplicial. When e < 7/ 100 there is a polynomial time algorithm 
that given M such that for all rows \\M l — M l \\\ < e, finds r row (almost anchor words) in M. The 
i-th almost anchor word corresponds to a row in M that can be represented as (1 — 0(e/ r y))W l + 
0(e/j)W~ l . Here W~ % is a vector in the convex hull of other rows in W with unit length in t\ 
norm. 

The major weakness of the algorithm in [2] is that it only considers the l\ norm. However in an r 
dimensional simplex there is another natural measure of distance more suited to our purposes: since 
each point is a unique convex combination of the vertices, we can view this convex combination as 
a probability distribution and use statistical distance (on this representation) as a norm for points 
inside the simplex. We will in fact need a slight modification to this norm, since we would like it 
to extend to points outside the simplex too: 

Definition 5.2 ((5, e)-close). A point M' 3 is (<5, e)-close to M H if and only if 

n 

min \\M H -V c fc Af /fc ||i < e. 

c*>0.£*=i<*=l.«y>l-tf £t[ 

Intuitively think of point M' 3 is (5, e)-close to M H if M h is e close in l\ distance to some point 
Q, where Q is a convex combination of the rows of M' that places at least 1 — 5 weight on M' 3 . 
Notice that this definition is not a distance since it is not symmetric, but we will abuse notation 
and nevertheless call it a distance function. We remark that this distance is easy to compute: To 
check whether M' 3 is (<5, e)-close to M h we just need to solve a linear program that minimizes the 
l\ distance when the c vector is a probability distribution with at least 1 — 5 weight on j (the 
constraints on c are clearly all linear). 

We also consider all points that a row M' 3 is close to, this is called the neighborhood of M' 3 . 

Definition 5.3 ((<5, e)-neighborhood). The (5, e)-neighborhood of M' 3 are the rows M' 1 such that 
M' 3 is (<5,e)-close to M'\ 

For each point M' 3 , we know its original (unperturbed) point Mj is in a convex combination 
of W' l, s: M 3 = X^[=i Aj t iW l . Separability implies that for any column index i there is a row 
f(i) in A whose only nonzero entry is in the i th column. Then M-^W = W l and consequently 
HAftfW - W*||i < e. Let us call these rows M'f^ for all i the canonical rows. From the above 
description the following claim is clear. 

Claim 5.4. Every row M' 3 has t\-distance at most 2e to the convex hull of canonical rows. 

Proof: We have: 

r r r 

\\M' 3 - ^A iife M' /(fc) ||i < \\M' 3 - + \\M j - J2 A 3,kM m \\i + \\Y, A jA Mf{k) ~ M' f W)\\i 

k=l k=l k=l 

and we can bound the right hand side by 2e. ■ 

The algorithm will distinguish rows that are close to vertices and rows that are far by testing 
whether each row is close (in l\ norm) to the convex hull of rows outside its neighborhood. In 
particular, we define a robust loner as: 
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Definition 5.5 (robust loner). We call a row M' 3 a robust-loner if it has t\ distance at most 2e 
to the convex hull of rows that are outside its (6e/7, 2e) neighborhood. 

Our goal is to show that a row is a robust loner if and only if it is close to some row in W. The 
following lemma establishes one direction: 

Lemma 5.6. If Aj t is smaller than 1 — lOe/7, the point M' 3 cannot be (6e/7, 2e)-close to the 
canonical row that corresponds to W l . 

Proof: Assume towards contradiction that M' J is (6e/7, 2e)-close to the canonical row M h which 
is a perturbation of W . By definition there must be probability distribution c 6 M. n over the rows 
such that Cj > 1 — 6e/7, and \\M H — J2k=i c kM' k \\i < 2e. Now we instead consider the unperturbed 
matrix M, since every row of M' is e close (in t\ norm) to M we know \\M l — Y!k=i c kM k \\\ < 4e. 
Now we represent M % and X^fc=i c kM k as convex combinations of rows of W and consider the 
coefficient on W l . Clearly M l = W l so the coefficient is 1. But for Ylk=i c kM k , since Cj > 1 — Ge/j 
and the coefficient Ajt < 1 — lOe/7, we know the coefficient of VF* in the sum must be strictly 
smaller than 1 — 10e/7+6e/7 = 1 — 4e/7. By the robustly simplicial assumption M % and Ylk=i c kM k 
must be more than 4e/7 • 7 = 4e apart in l\ norm, which contradicts our assumption. ■ 

As a corollary: 

Corollary 5.7. If Ajt is smaller than 1 — lOe/7 for all t, the row M' 3 cannot be a robust loner. 

Proof: By the above lemma, we know the canonical rows are not in the (6e/7, 2e) neighborhood 
of M' 3 . Thus by Claim 5.4 the row is close to the convex hull of canonical rows and cannot be a 
robust loner. ■ 

Next we prove the other direction: a canonical row is necessarily a robust loner: 
Lemma 5.8. All canonical rows are robust loners. 

Proof: Suppose M h is a canonical row that corresponds to W . We first observe that all the rows 
that are outside the (6e/7, 2e) neighborhood of M' 3 must have Ajt < 1 — 65/7. This is because 
when A j>t > 1 - 6e/7 we have M 3 - YJk=\ A itW l = 0. If we replace M 3 by M' 3 and W l by the 
corresponding canonical row, the distance is still at most 2e and the coefficient on M' % is at least 
1 — 6e/7- By definition the corresponding row M' 3 must be in the neighborhood of M'\ 

Now we try to represent M H with convex combination of rows that has Ajj < 1 — 65/7. However 
this is impossible because every point in the convex combination will also have weight smaller than 
1 — 6e/7 on W l , while M % has weight 1 on W . The i\ distance between Mi and the convex 
combination of the Mf's where Ajt < 1 — 6e/7 is a least 6e by robust simplicial property. Even 
when the points are perturbed by e (in l\) the distance can change by at most 2e and is still more 
than 2e. Therefore M n is a robust loner. ■ 

Now we can prove the main theorem of this section: 

Proof: Suppose we know 7 and lOOe < 7.When 7 is so small we have the following claim: 

Claim 5.9. If Aj t t and An is at least 1 — lOe/7, and t 7^ I, then M' 3 cannot be (lOe/7, 2e)-close 
to M H and vice versa. 
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The proof is almost identical to Lemma 5.6. Also, the canonical row that corresponds to W t is 
(lOe/7, 2e) close to all rows with Ajj > 1 — lOe/7. Thus if we connect two robust loners when one 
is (lOe/7, 2e) close to the other, the connected component of the graph will exactly be a partition 
according to the row in W that the robust loner is close to. We pick one robust loner in each 
connected component to get the almost anchor words. 

Now suppose we don't know 7. In this case the problem is we don't know what is the right size 
of neighborhood to look at. However, since we know 7 > lOOe, we shall first run the algorithm with 
7 = lOOe to get r rows W that are very close to the true rows in W. It is not hard to show that 
these rows are at least 7/2 robustly simplicial and at most 7 + 2e robustly simplicial. Therefore we 
can compute the j(W) parameter for this set of rows and use j(W) — 2e as the 7 parameter. ■ 

6 Maximum Likelihood Estimation is Hard 

Here we prove that computing the Maximum Likelihood Estimate (MLE) of the parameters of a 
topic model is ./VP-hard. We call this problem the Topic Model Maximum Likelihoood Estimation 
(TM-MLE) problem: 

Definition 6.1 (TM-MLE). Given m documents and a target of r topics, the TM-MLE problem 
asks to compute the topic matrix A that has the largest probability of generating the observed 
documents (when the columns of W are generated by a uniform Dirichlet distribution). 

Surprisingly, this appears to be the first proof that computing the MLE estimate in a topic 
model is indeed computationally hard, although its hardness is certainly to be expected. On 
a related note, Sontag and Roy [29] recently proved that given the topic matrix and a document, 
computing the Maximum A Posteriori (MAP) estimate for the distribution on topics that generated 
this document is ./VP-hard. Here we will establish that TM-MLE is ./VP-hard via a reduction from 
the MIN-BISECTION problem: In MIN-BISECTION the input is a graph with n vertices (n is an 
even integer), and the goal is to partition the vertices into two equal sized sets of n/2 vertices each 
so as to minimize the number of edges crossing the cut. 

Theorem 6.2. There is a polynomial time reduction from MIN-BISECTION to TM-MLE (r = 2). 

Proof: Suppose we are given an instance G of the MIN-BISECTION problem with n vertices and 
m edges. We will now define an instance of the TM-MLE problem. First, we set the number of 
words to be n. For each word i, we construct iV = [200m 3 logn] documents each of which contain 
the word i twice and no other words. For each edge in the graph G, we construct a document 
whose two words correspond to the endpoints of the edge. 

Suppose that x = (x±,X2) is generated by the Dirichlet distribution Dir(l, 1). Consequently 
the probability that words i and j appear in a document with only two words is exactly (A l x) ■ (A 3 x) . 
We can take the expectation of this term over the Dirichlet distribution Dir(l, 1) and hence the 
probability that a document (with exactly two words) contains the words i and j is 



In the TM-MLE problem, our goal is to maximize the following objective function (which is the 
log of the probability of generating the collection of documents): 




OBJ 



document = {i,j} 
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For any bisection, we define a canonical solution: the first topic is uniform on all words on one 
side of the bisection and the second topic is uniform on all words on the other side of the bisection. To 
prove the correctness of our reduction, a key step is to show that any candidate solution to the 
MLE problem must be close to a canonical solution. In particular, we show the following: 

1. The rows A 1 have almost the same l\ norm. 

2. In each row A 1 , almost all of the weight will be in one of the two topics. 

Indeed, canonical solutions have large objective value. Any canonical solution has objective 
value at least — iVn log 3n 2 /4 — mlog3n 2 /2 (this is because documents with same words contribute 
— log3n 2 /4 and documents with different words contribute at least — log3n 2 /2). 

Recall, in our reduction N is large. Roughly, if one of the rows has l\ norm that is bounded 
away from 2/n by at least l/20nm, the contribution (to the objective function) of documents with 
a repeated word will decrease significantly and the solution cannot be optimal. To prove this we 
use the fact that the function logx 2 = 21ogx is concave. Therefore when one of the rows has 
l\ norm more than 2/n + l/20nm, the optimal value for documents with a repeated word will 
be attained when all other rows have the same i\ norm 2/n — l/20nm(n — 1). Using a Taylor 
expansion, we conclude that the sum of terms for documents with a repeated word will decrease 
by at least N/bOm 2 which is much larger than any effect the remaining m documents can recoup. 
In fact, an identical argument establishes that in each row A 1 , the topic with smaller weight will 
always have weight smaller than l/20nm. 

Now we claim that among canonical solutions, the one with largest objective value corresponds 
to a minimum bisection. The proof follows from the observation that the value of the objective 
function is — iVnlog3ra 2 /4 — fclog3n 2 /2 — (m — /c)log3n 2 /4 for canonical solutions, where k is the 
number of edges cut by the bisection. In particular, the objective function of the minimum bisection 
will be at least an additive log 2 larger than the objective function of a non-minimum bisection. 

However, even if the canonical solution is perturbed by l/20nm, the objective function will only 
change by at most m ■ l/10m = 1/10, which is much smaller than And this completes our 

reduction. ■ 

We remark that the canonical solutions in our reduction are all separable, and hence this re- 
duction applies even when the topic matrix A is known (and required) to be separable. So, even 
in the case of a separable topic matrix, it is ./VP-hard to compute the MLE. 

7 Conclusions 

We expect that versions of our algorithm may indeed be practical, and are investigating this pos- 
sibility. Our machine learning colleagues suggest that real-life topic matrices satisfy even stronger 
separability assumptions, e.g., the presence of many anchor words per topic instead of a single one. 
This is a promising suggestion, but leveraging it in our algorithm is an open problem. 

Is separability necessary for allowing polynomial-time algorithms for the learning problems 
considered here? In other words, is the problem difficult if the topic matrix A is not separable? 
Average-case intractability seems more plausible here than NP-completeness. 
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